Hello and welcome to this short documentation of my
forecasting project, which is done within the context of a forecasting
theory seminar.
This forecasting project is about predicting the overall taxi rides in New York, Queens (probably over a time span of a month, but on a daily basis).
Here is a short map of Queens, and its districts:
From: NYC.gov
One could decompose this forecasting/ prediction project within each district and sum those up at the end. Decomposing/ segmentation is one of the forecasting principles (Green, Graefe, Armstrong. 2001), so I’ll try to dissect the problem into several smaller one. As I’ve looked through the raw data itself, the JFK Airport (132) and the LaGuardia Airport (138) have the most taxi rides, whereas the other regions are almost negligible at best. But if we sum all of the other regions, we may get a 3rd “region”. So in this way, I will forecast the total amount of taxi rides per day by summing up the predicted taxi rides for the 3 different region(s).
But lets import all the data first.
The following .rds files are the results of several cleaning and importing steps. Because the dataset itself was patched together with over 133 PARQUET files (just a different format to store the data in) which also takes time, I decided to store them in .rds files so it is easier to access without having to run the same code over and over again for each session.
The data was collected through several sources. Yellow taxi related data comes from NYC.gov, whereas weather related data comes from the National Weather Service. As a short limitation from the start, the weather data was gathered from the La Guardia Airport (so one of the subsets/location site we will be forecasting on). Whether this data holds for the other parts is not known, but I would argue that it is less reliable (and therefore valid) for the other parts.
As the amount of taxi rides was influences by the COVID-19 pandemic as well, I’ve searched through the internet and have found a timeline of government responses (thus also their lock down dates and other restrictions). As it is not in a beautiful data format, the only thing I can do is hardcode it in R via a vector and append it on the dataframe. The variable will either be:
This will be a limitation as well, of course.
#read in data
taxi_rides <- readRDS("cleaned_datasets\\taxi_rides.rds")
taxi_zone <- readRDS("cleaned_datasets\\taxi_zone.rds")
max_rain <- readRDS("cleaned_datasets\\max_rain_LG_Airport.rds")
max_temp <- readRDS("cleaned_datasets\\max_temp_LG_Airport.rds")
#join temp and rain dataset, and clean them accordingly
weather_data <- max_rain %>%
full_join(max_temp, by = c("Jahr", "Monat", "Tag")) %>%
mutate(Monat = as.numeric(Monat)) %>%
select(-Max_rain_inches, -Max_Temp)
# we do not have many datapoints for February 2024, so I'll remove everything up until 31.01.2024
taxi_rides %>%
group_by(Year_PU, Month_PU) %>%
summarize(count_Days = n_distinct(Day_PU)) %>%
paged_table(options = list(rows.print = 12))
#remove days where year is 2024, but month is 2 (not enough data points to work with, I'll forecast for the January instead)
remove_rows <- which(taxi_rides$Year_PU == 2024 & taxi_rides$Month_PU == 2)
taxi_rides <- taxi_rides[-remove_rows, ]
remove_rows <- which(taxi_zone$Year_PU == 2024 & taxi_zone$Month_PU == 2)
taxi_zone <- taxi_zone[-remove_rows, ]
full_ride <- taxi_rides %>%
left_join(weather_data, by = c("Year_PU" = "Jahr",
"Month_PU" = "Monat",
"Day_PU" = "Tag"))
full_zone <- taxi_zone %>%
left_join(weather_data, by = c("Year_PU" = "Jahr",
"Month_PU" = "Monat",
"Day_PU" = "Tag"))
#look whether there are NA's created
sum(is.na(full_ride)); sum(is.na(full_zone))
full_zone2 <- full_zone %>% group_by(PULocation, Year_PU, Month_PU, Day_PU, Max_Temp_C, Max_rain_mm) %>%
summarise(rides_day = sum(rides)) %>%
ungroup() %>%
mutate(date = as.Date(paste(Year_PU, Month_PU, Day_PU, sep = "-")))%>%
select(-c(Year_PU, Month_PU, Day_PU))
full_ride2 <- full_ride %>% group_by(Year_PU, Month_PU, Day_PU, Max_Temp_C, Max_rain_mm) %>%
summarise(rides_day = sum(rides)) %>%
ungroup() %>%
mutate(date = as.Date(paste(Year_PU, Month_PU, Day_PU, sep = "-"))) %>%
select(-c(Year_PU, Month_PU, Day_PU))
# create vector where we have the "lockdown" data as specified in the previous section
lockdown_dat <- data.frame(date = full_ride2$date) %>%
mutate(lock_down = case_when(
between(date, as.Date("2020-03-07"), as.Date("2020-06-07")) ~ 5,
between(date, as.Date("2020-06-08"), as.Date("2020-06-22")) ~ 4,
between(date, as.Date("2020-06-23"), as.Date("2020-07-06")) ~ 3,
between(date, as.Date("2020-07-07"), as.Date("2020-07-20")) ~ 2,
between(date, as.Date("2020-07-21"), as.Date("2021-05-21")) ~ 1,
TRUE ~ 0
))
full_ride2 <- full_ride2 %>%
full_join(lockdown_dat, by = "date")
full_zone2 <- full_zone2 %>%
full_join(lockdown_dat, by = "date")
Let’s start by turning the dataframe into a tsibbles (time series tibbles):
zone <- full_zone2 %>% as_tsibble(key = PULocation, index = date)
rm(list = setdiff(ls(),c("zone", "full_ride2")))
zone
## # A tsibble: 12,144 x 6 [1D]
## # Key: PULocation [3]
## PULocation Max_Temp_C Max_rain_mm rides_day date lock_down
## <chr> <dbl> <dbl> <int> <date> <dbl>
## 1 JFK Airport 5 0 7898 2013-01-01 0
## 2 JFK Airport 1.11 0 9898 2013-01-02 0
## 3 JFK Airport 1.11 0 8773 2013-01-03 0
## 4 JFK Airport 3.89 0 7565 2013-01-04 0
## 5 JFK Airport 6.11 0 7797 2013-01-05 0
## 6 JFK Airport 8.33 0 8305 2013-01-06 0
## 7 JFK Airport 7.22 0 9019 2013-01-07 0
## 8 JFK Airport 9.44 0 7128 2013-01-08 0
## 9 JFK Airport 10.6 0 6846 2013-01-09 0
## 10 JFK Airport 8.89 0 7183 2013-01-10 0
## # ℹ 12,134 more rows
We have data ranging from 2013-01-01 to
2024-01-31, so 4048 days in total.
In this section, we will just plot our metric of interest, that is
taxi rides on a given day, over time.
Two options are displayed: amount of taxi rides divided by the 3 zones,
or the total taxi rides (which is just the summed up version of the 3
zones).
In both cases, we can see that the amount of taxi rides dipped in the
beginning of the year 2020, when the pandemic hit USA. But overall, we
could also see a slight trend downwards from 2013 to the beginning of
2020, and a slight increase after 2022.
Here is a plot where we can see the total amount of rides per zone:
zone %>% autoplot(rides_day, alpha = .5)+
labs(y = "Taxi Rides")
Here is a plot which gives us a feel for the total number of taxi rides over time (this is what we want to forecast):
total <- zone %>% index_by(date) %>%
summarize(Total= sum(rides_day))
total %>% autoplot(Total)+
labs(y = "Total Taxi Rides")+
theme_minimal()
If we want to have similar variances in each time period (and thus, having similar seasonal patterns over time), we can use transformations to achieve said goal. Box-cox transformation mixes logarithms and power transformations and depend on \(\lambda\). We can use the features function to search for the optimal \(\lambda\)-value for our case.
lambda <- total %>%
features(Total, features = guerrero) %>%
pull(lambda_guerrero)
total <- total %>%
mutate(ride_tr = box_cox(Total, lambda))
total %>%
autoplot(ride_tr)+
labs(y = "Transformed Total Taxi Rides")+
theme_minimal()
As this plot seems to have better properties, we will use these
transformed data from here on out, but will transform back so the
results are interpretable if needed.
The only concern I have going forward is the data still contains huge
“outliers” sort of, where the amount of taxi rides takes a dip, and goes
back to normal… Maybe it was due to a malfunction during data collection
for most taxis?
Instead of only plotting what we have, one can also describe the data with some descriptive statistics. Here is a table consisting of the 3 different zones and simple statistics:
zone %>%
features(rides_day, list(mean = mean, sd = sd, median = median, min = min, max = max))
## # A tibble: 3 × 6
## PULocation mean sd median min max
## <chr> <dbl> <dbl> <dbl> <int> <int>
## 1 JFK Airport 6293. 2520. 6917 16 12140
## 2 La Guardia Airport 6307. 3698. 6152. 7 17407
## 3 Other 3094. 1927. 2928 62 11247
So during the covid pandemic, the minimum taxi ride on a day was 7.
zone %>%
features(rides_day, quantile)
## # A tibble: 3 × 6
## PULocation `0%` `25%` `50%` `75%` `100%`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 JFK Airport 16 5209 6917 8024. 12140
## 2 La Guardia Airport 7 3222. 6152. 9497. 17407
## 3 Other 62 1286 2928 4300. 11247
In the following series of plots, we will try and examine whether there are underlying trends and seasonality among our data. Plotting is a useful tool to grasp your data as opposed to scroll through your whole datatable.
In this plot, we can see the yearly trend for each of the 3 different zones. As mentioned previously, we do see a decrease in taxi rides after each year. Interestingly, the two airports were only affected between the earlier days of 2020 until end of 2021, whereas the other zones had a longer period until they recovered to post lock-down conditions.
zone %>%
gg_season(rides_day, labels = "right", alpha = .5)+
theme_minimal()+
labs(y= "Taxi Rides")
Again, we can see the decline in raxi rides for each additional year, and then it finds an equilibrium around 100’000 taxi-rides per day during 2022 and 2023.
total %>%
gg_season(ride_tr, labels = "right", period = "year")+
labs(y = "Transformed Total Taxi Rides",
title = TeX(paste0("Box Cox Transformed Taxi Rides with $\\lambda$ = ", round(lambda, 2))))+
theme_minimal()
This is a lag plot, where each lag is a discrepancy of one year. We should see around 365 points in each plot, where the y-axis represents the total of taxi rides in a day, wheras on the x-axis we can see the lagged taxi rides after lag_x year. So in this sense, each point illustrates each taxi ride in the first year and the x-th year. If the points would all align to the diagonal, then we would have a perfect correlation (and forecasting would therefore be easy, as knowing one quantity on one side can be used to estimate the other in x-years). We can also see that the higher lag numbers come with more spread, which makes sense as the correlations over a greater time span are lower than those which are more proximate. In this case, using time-series data which dates several years back are less reliable for forecasting than those that lie one year behind.
total %>%
gg_lag(y = ride_tr, lags = 1:16, period = "year", geom = "point")+
theme_minimal()+
theme(legend.position = "none")
Or a lagplot for different weekdays:
total %>%
gg_lag(y = ride_tr, lags = 1:16, geom = "point", period = "weeks")+
theme(legend.position = "none")+
theme_minimal()
In this case, while all the weekdays correlate strongly, in this plot we can also see the different colored “lines”, as weekdays also has higher correlations
These are the correlations for each lag, for those who are interested in it.
total %>%
ACF(y = Total, lag_max = 1000)
## # A tsibble: 1,000 x 2 [1D]
## lag acf
## <cf_lag> <dbl>
## 1 1D 0.934
## 2 2D 0.901
## 3 3D 0.924
## 4 4D 0.922
## 5 5D 0.895
## 6 6D 0.910
## 7 7D 0.944
## 8 8D 0.908
## 9 9D 0.891
## 10 10D 0.915
## # ℹ 990 more rows
total %>%
ACF(y = Total, lag_max = 365) %>%
autoplot()+
theme_minimal()+
theme(axis.minor.ticks.length = element_blank(),
axis.minor.ticks.theta = element_blank(),
axis.minor.ticks.r = element_blank(),
panel.grid.major = element_blank())
As we can see, there is a trend (e.g. the autocorrelation decreases over time), but there are still spikes here and there, which also indicates seasonality.
We can decompose our time-series into different parts, such as trends, seasonality and remainders. Here is a table after the decomposition was done with the seasonal-trend decomposition using Loess (STL).
decomp <- total %>%
model(stl = STL(ride_tr))
components(decomp) %>%
mutate_if(.predicate = is.numeric, .funs = ~round(., digits = 2))%>%
paged_table()
We can also try to describe the seasonality and trend strenghts:
zone %>%
features(rides_day, feat_stl)
## # A tibble: 3 × 10
## PULocation trend_strength seasonal_strength_week seasonal_peak_week
## <chr> <dbl> <dbl> <dbl>
## 1 JFK Airport 0.962 0.606 0
## 2 La Guardia Airport 0.953 0.753 0
## 3 Other 0.975 0.823 5
## # ℹ 6 more variables: seasonal_trough_week <dbl>, spikiness <dbl>,
## # linearity <dbl>, curvature <dbl>, stl_e_acf1 <dbl>, stl_e_acf10 <dbl>
we can also see that for the “other” group, seasonality is higher for weeks compared to the two airports. Here is a plot of the model (here in yellow), if we transform the data back to its original shape.
components(decomp) %>%
as_tsibble() %>%
autoplot(inv_box_cox(`ride_tr`, lambda)) +
geom_line(aes(y = inv_box_cox((trend),lambda)), color = "yellow")
components(decomp) %>%
autoplot()
Before modeling our data, we should have a benchmark which our models need to surpass. For this, we can create 4 different benchmark models and see which one is best.
set.seed(42)
n_train <- min(total$date) + round(nrow(total)* .9)
n_test <- nrow(total) - round(nrow(total)* .9)
ride_train <- total %>%
filter(date <= n_train)
ride_fit <- ride_train %>%
model(`Seas Naive` = SNAIVE(box_cox(Total, lambda)),
`Mean` = MEAN(box_cox(Total, lambda)),
Naive = NAIVE(box_cox(Total, lambda)),
Drift = RW(box_cox(Total, lambda) ~ drift())
)
ride_fc <- ride_fit %>%
forecast(h = n_test) %>%
filter(date <= "2024-01-31")
#ride_fc <- ride_fc %>% mutate(Total = inv_box_cox(.mean, lambda))
ride_fc %>%
autoplot(level =NULL, linewidth = .9, alpha = .8)+
autolayer(filter_index(total, "2022-01-01" ~ . ))+
scale_color_manual(values = my_palette)+
theme_minimal()
In this case, the naive and drift model were bad, the mean is acceptable, but the seasonaly naive could be best.
accuracy(ride_fc, total)%>%
mutate_if(is.numeric, .funs = ~round(., digits = 2))
## # A tibble: 4 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Drift Test -23666. 28092. 23748. -511. 511. 15.2 11.2 0.99
## 2 Mean Test -6517. 6782. 6517. -173. 173. 4.18 2.71 0.56
## 3 Naive Test -23715. 28067. 23796. -511. 512. 15.3 11.2 0.99
## 4 Seas Naive Test -2861. 4102. 3094. -119. 121. 1.99 1.64 0.83
Lets train on 90% of the data, and test it on the remaining ones.
set.seed(42)
n_train <- min(total$date) + round(nrow(total)* .9)
n_test <- nrow(total) - round(nrow(total)* .9)
ride_train <- total %>%
filter(date <= n_train)
ride_fit <- ride_train %>%
model(`Seas Naive` = SNAIVE(box_cox(Total, lambda)),
ARIMA_search = ARIMA(box_cox(Total, lambda)),
ARIMA_spec111 = ARIMA(box_cox(Total, lambda) ~ 1 + pdq(1,1,1)),
sARIMA = ARIMA(box_cox(Total, lambda) ~ 1 + pdq(3,1,3) + PDQ(1, 0, 0)),
`Exp_smooth_with_damp` = ETS(box_cox(Total, lambda) ~ error("A") + trend("A", phi = 0.4) + season("N")),
`sExp_smooth_with_damp` = ETS(box_cox(Total, lambda) ~ error("A") + trend("A", phi = 0.4) + season("A"))
)
ride_fc <- ride_fit %>%
forecast(h = n_test) %>%
filter(date <= "2024-01-31")
#ride_fc <- ride_fc %>% mutate(Total = inv_box_cox(.mean, lambda))
ride_fc %>%
autoplot(level =NULL, linewidth = .9, alpha = .8)+
autolayer(filter_index(total, "2022-01-01" ~ . ))+
scale_color_manual(values = my_palette)
pdq in ARIMA stands for:
- p: Autoregressive Order (number of lagged observations included in the
model)
- d: differencing Order (number of times the data needs to be
differenced in order to remove trends and stabilize the mean)
- q: moving average Order (number of lagged forecast errorsincluded in
the model)
accuracy(ride_fc, total)%>%
mutate_if(is.numeric, .funs = ~round(., digits = 2))
## # A tibble: 6 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA_search Test -114. 2128. 1572. -71.7 84.1 1.01 0.85 0.66
## 2 ARIMA_spec111 Test 548. 2007. 1553. -58.9 76.9 1 0.8 0.59
## 3 Exp_smooth_with_damp Test 168. 1981. 1488. -65.4 79.4 0.96 0.79 0.6
## 4 Seas Naive Test -2861. 4102. 3094. -119. 121. 1.99 1.64 0.83
## 5 sARIMA Test 634. 1906. 1463. -56.1 74.2 0.94 0.76 0.68
## 6 sExp_smooth_with_damp Test 280. 1818. 1332. -60.4 74.4 0.86 0.73 0.76
Our ARIMA search model overfitted a bit, wheras our consevative ARIMA approach was better (111)
So our seasonal Exponential smoothing model and seasonal ARIMA was best, but it still sucks… a mean average percentage error (MAPE) of 74 is still very bad. This is probably due to the long period of time we try to predict on. Let’s see whether this changes when we only try to predict the next months worth of taxi rides instead of 13 months.
This uses the same pipeline, but this time with transformed taxi rides data.
set.seed(42)
n_train <- min(total$date) + round(nrow(total)* .9)
n_test <- nrow(total) - round(nrow(total)* .9)
ride_train <- total %>%
filter(date <= n_train)
ride_fit_tr <- ride_train %>%
model(`Seas Naive` = SNAIVE(ride_tr),
ARIMA_search = ARIMA(ride_tr),
ARIMA_spec111 = ARIMA(ride_tr ~ 1 + pdq(1,1,1)),
sARIMA = ARIMA(ride_tr ~ 1 + pdq(3,1,3) + PDQ(1, 0, 0)),
`Exp_smooth_with_damp` = ETS(ride_tr ~ error("A") + trend("A", phi = 0.4) + season("N")),
`sExp_smooth_with_damp` = ETS(ride_tr ~ error("A") + trend("A", phi = 0.4) + season("A"))
)
ride_fc_tr <- ride_fit_tr %>%
forecast(h = n_test) %>%
filter(date <= "2024-01-31")
#ride_fc <- ride_fc %>% mutate(Total = inv_box_cox(.mean, lambda))
ride_fc_tr %>%
autoplot(level =NULL, linewidth = .9, alpha = .8)+
autolayer(filter_index(total, "2022-10-01" ~ . ), .vars = ride_tr)+
scale_color_manual(values = my_palette)
Please bare in mind that our dataset spans several years, and we only
forecast on 405 days, the plot above only shows a snippet
of our actual data (in this case, the relevant timeframe). The data is
still in transformed form.
We can also quantify the forecasting quality by computing and comparing accuracy metrics:
accuracy(ride_fc_tr, total) %>%
mutate_if(is.numeric, .funs = ~round(., digits = 2))
## # A tibble: 6 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA_search Test 1.59 3.49 2.4 2.3 7.78 1.93 1.63 0.65
## 2 ARIMA_spec111 Test 2.01 3.7 2.7 3.43 8.49 2.16 1.73 0.66
## 3 Exp_smooth_with_damp Test 1.61 3.5 2.42 2.35 7.82 1.94 1.64 0.65
## 4 Seas Naive Test 1.01 3.21 2.03 0.8 6.88 1.63 1.51 0.67
## 5 sARIMA Test 1.95 3.6 2.61 3.31 8.26 2.09 1.68 0.71
## 6 sExp_smooth_with_damp Test 2.23 3.66 2.8 4.11 8.73 2.25 1.72 0.73
For the transformed data, the seasonaly naive (our benchmark model) is now best. Probably because the transformed data does not have that much variance, a “straight line” is best. What about for a shorter timeperiod?
set.seed(42)
n_train <- max(total$date) - 90
n_test <- 90
ride_train <- total %>%
filter(date <= n_train)
ride_fit <- ride_train %>%
model(`Seas Naive` = SNAIVE(ride_tr),
#ARIMA_search = ARIMA(ride_tr),
#ARIMA_spec111 = ARIMA(ride_tr ~ 1 + pdq(1,1,1)),
sARIMA = ARIMA(ride_tr ~ 1 + pdq(3,1,3) + PDQ(1, 0, 0)),
`Exp damp` = ETS(ride_tr ~ error("A") + trend("A", phi = 0.4) + season("N")),
`sExp damp` = ETS(ride_tr ~ error("A") + trend("A", phi = 0.4) + season("A"))
)
ride_fc <- ride_fit %>%
forecast(h = n_test) %>%
filter(date <= "2024-01-31")
#ride_fc <- ride_fc %>% mutate(Total = inv_box_cox(.mean, lambda))
ride_fc %>%
autoplot(level =NULL, linewidth = 1.2, alpha = .9)+
autolayer(filter_index(total, paste(n_train-2) ~ . ), .vars = ride_tr, linewidth = 1.3)+
scale_color_manual(values = my_palette)+
theme_minimal()+
theme(legend.text = element_text(size = 10),
legend.title = element_blank(),
legend.position = "top",
legend.direction = "horizontal",
axis.text.x = element_text(size = 10))+
guides(color = guide_legend(override.aes = list(size = 8, linewidth = 5)))+
labs(y = "Transformed Taxi Rides")
accuracy(ride_fc, total) %>%
mutate_if(is.numeric, .funs = ~round(., digits = 2))
## # A tibble: 4 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Exp damp Test -0.77 2.21 1.69 -2.37 4.64 1.32 0.93 0.52
## 2 Seas Naive Test -0.79 2.42 1.85 -2.39 5.04 1.45 1.02 0.52
## 3 sARIMA Test -0.82 2.14 1.63 -2.48 4.5 1.28 0.9 0.59
## 4 sExp damp Test -0.52 2.04 1.53 -1.62 4.17 1.2 0.86 0.62
Now the seasonal exponential smoothing model was best, followed by the seasonal ARIMA model, and then the dampened exponential smoothed one. All of them fared better than the benchmarkmodel.
But what about the residuals? Do these model map the data equally well for all timepoints?
aug <- augment(ride_fit)
aug %>%
autoplot(.innov)+
theme_minimal()+
facet_grid(rows = vars(.model))
aug %>%
ggplot(aes(x = .innov))+
geom_histogram()+
facet_grid(rows = vars(.model))+
theme_minimal()
aug %>%
ACF(.innov) %>%
autoplot()
All of these residual plots above indicates that all the models are pretty fine. Some of them probably fit the data too well even… Bare in mind these are only on the training data! So them having low amount of residual is normal.
Let’s plot all of the residual plots for our best model:
ride_train %>%
model(`sExp damp` = ETS(ride_tr ~ error("A") + trend("A", phi = 0.4) + season("A"))) %>%
gg_tsresiduals()
So far, we’ve only forecasted for the total amount of rides in Queens. But as said in the introduction, we can actually try to fit a model per each district, and then combine them all together. Let us see whether we get a good forecast.
set.seed(42)
n_train <- max(total$date) - 90
n_test <- 90
zone_train <- zone %>%
group_by(PULocation) %>%
filter(date <= n_train)
zone_fit <- zone_train %>%
model(
`Seas Naive` = SNAIVE(rides_day),
`sExp damp` = ETS(rides_day ~ error("A") + trend("A", phi = 0.1) + season("A"))
)
zone_fc <- zone_fit %>%
forecast(h = n_test)
combined_zone_fc <- zone_fc %>%
as_tsibble() %>%
group_by(.model) %>%
index_by(date) %>%
summarize(tot_pred = sum(.mean), .groups = "drop") %>%
as_tsibble()
combined_zone_fc %>%
autoplot(tot_pred, level = NULL, linewidth = 1.2, alpha = .8)+
autolayer(filter_index(total, paste(n_train-2) ~. ), .vars = Total, linewidth = 1.3)+
theme_minimal()+
scale_color_manual(values = my_palette)
We can also use timeseries data to draw some linear regression. As always, the higher the correlations are between the data, the better.
First of all, let’s see whether we can use our weather and lockdown data to “describe” the data with a linear regression.
total %>%
left_join(full_ride2, by = "date") %>%
model(tslm_weather_lockdown = TSLM(Total ~ Max_Temp_C + lock_down + Max_rain_mm),
tslm_weather = TSLM(Total ~ Max_Temp_C + Max_rain_mm)) %>%
report()
## # A tibble: 2 × 15
## .model r_squared adj_r_squared sigma2 statistic p_value df log_lik AIC
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
## 1 tslm_… 0.246 0.246 4.20e7 441. 1.19e-247 4 -41272. 71066.
## 2 tslm_… 0.00258 0.00208 5.56e7 5.23 5.40e- 3 3 -41839. 72198.
## # ℹ 6 more variables: AICc <dbl>, BIC <dbl>, CV <dbl>, deviance <dbl>,
## # df.residual <int>, rank <int>
While the result may seem very good, it does not generalize well, as the predictive part is the lockdown itself, which is in itself a rare event and is only helpful during a pandemic. Do not overestimate this finding, as this variable is useless outside of the pandemic…
Looking only at the R-squared value, our weather data does not seem to be very good (even though in the literature, there should be a big effect). One possible reason is that our data doesn’t have enough granularity (e.g., if we had hourly data, then maybe weather data such as rain could be predictive).
But what about using the taxi ride from one airport and trying to predict the amount of taxirides for the other?
zone %>%
pivot_wider(names_from = PULocation,
values_from = rides_day) %>%
rowwise() %>%
mutate(Total = sum(`JFK Airport`, `La Guardia Airport`, Other)) %>%
as_tsibble(index = date) %>%
model(LG_JFK = TSLM(`La Guardia Airport` ~ `JFK Airport`),
LG_JFK_weather = TSLM(`La Guardia Airport` ~ `JFK Airport` + Max_Temp_C + Max_rain_mm),
LG_JFK_weather_lockdown = TSLM(`La Guardia Airport` ~ `JFK Airport` + Max_Temp_C + Max_rain_mm + lock_down),
LG_weather_only = TSLM(`La Guardia Airport` ~ Max_Temp_C + Max_rain_mm)) %>%
report()
## # A tibble: 4 × 15
## .model r_squared adj_r_squared sigma2 statistic p_value df log_lik AIC
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
## 1 LG_JFK 0.740 0.740 3.55e6 11529. 0 2 -36272. 61062.
## 2 LG_JFK_… 0.744 0.744 3.50e6 3919. 0 4 -36241. 61005.
## 3 LG_JFK_… 0.756 0.756 3.34e6 3129. 0 5 -36146. 60816.
## 4 LG_weat… 0.00342 0.00293 1.36e7 6.95 9.71e-4 3 -38993. 66506.
## # ℹ 6 more variables: AICc <dbl>, BIC <dbl>, CV <dbl>, deviance <dbl>,
## # df.residual <int>, rank <int>
We can see that using this approach, we get a very good model. Adding weather and lockdown data was not helpful at all, as these factors are already in the taxi rides data as well.
Box, G. E. P., & Cox, D. R. (1964). An analysis of transformations. Journal of the Royal Statistical Society. Series B, Statistical Methodology, 26(2), 211–252. https://doi.org/10.1111/j.2517-6161.1964.tb00553.x
Cleveland, R. B., Cleveland, W. S., McRae, J. E., & Terpenning, I. J. (1990). STL: A seasonal-trend decomposition procedure based on loess. Journal of Official Statistics, 6(1), 3–33. http://bit.ly/stl1990
Green, K.C., Graefe, A., Armstrong, J.S. (2011). Forecasting Principles. In: Lovric, M. (eds) International Encyclopedia of Statistical Science. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04898-2_257